Khan 2018 Mouse Lens Re-analysis¶

Lens changes with age notebook¶

Younger versus older lenses in a pairwise comparison¶

TMM normalization will not be used (IRS only)¶

Prepared by Phil Wilmarth, PSR Core, OHSU¶

December 12, 2023¶


  • Overview
  • Data loading
  • Young versus old comparison
  • Summary

Overview¶

This is a three plex experiment using 10-plex TMT with no reference channels per plex. Each plex is a set of biological replicates of developing mouse lens: E15, E18, P0, P3, P6, P9 (six ages, 3 day intervals). Data were downloaded from PRIDE (PXD006381) and processed with the PAW Pipeline (Wilmarth 2009) and IRS normalized using the average of each plex as a mock reference channel (Plubell 2017). This is the publication:

Khan, S.Y., Ali, M., Kabir, F., Renuse, S., Na, C.H., Talbot, C.C., Hackett, S.F. and Riazuddin, S.A., 2018. Proteome profiling of developing murine lens through mass spectrometry. Investigative Ophthalmology & Visual Science, 59(1), pp.100-107.

Experimental Groups:¶

Age Description Biological Replicates Pooled Mice Lenses
E15 Embryonic day 15 3 23 embryos
E18 Embryonic day 18 3 10 embryos
P0 Newborn 3 8 pups
P3 3-day old 3 8 pups
P6 6-day old 3 8 pups
P9 9-day old 3 8 pups

Sample Key:¶

Channel Plex 1 Plex 2 Plex 3
126C E15 E15 E15
127N E18 E18 E18
127C P0 P0 P0
128N P3 P3 P3
128C P6 P6 P6
129N P9 P9 P9
129C Mix? Mix? Mix?
130N unused unused unused
130C unused unused unused
131N unused unused unused

This notebook will compare lenses of different ages¶

There were three plexes with biological replicates of the 6 ages. There was no pooled internal reference standard. The plex average intensity for each protein was used as a mock reference intensity vector. IRS was used to put all three plexes (18 samples) on a common intensity scale. Statistical analysis is done with edgeR (Robinson 2010A) exact test after TMM normalization (Robinson 2010B).

We are doing a simple comparison between yonger lenses (E15 and E18) versus older lenses (P6 and P9) to put proteins into three categories: those that go up, those that stay the same, and those that go down. We will skip the TMM normalization here to see how that changes the testing.

References¶

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Plubell, D.L., Wilmarth, P.A., Zhao, Y., Fenton, A.M., Minnier, J., Reddy, A.P., Klimek, J., Yang, X., David, L.L. and Pamir, N., 2017. Extended multiplexing of tandem mass tags (TMT) labeling reveals age and high fat diet specific proteome changes in mouse epididymal adipose tissue. Molecular & Cellular Proteomics, 16(5), pp.873-890.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics, 26(1), pp.139-140.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), pp.1-9.


Load the necessary R libraries¶

In [1]:
# library imports
library(tidyverse)
library(scales)
library(limma)
library(edgeR)
library(psych)
Warning message:
“replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘tibble’”
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.3.2     ✔ purrr   0.3.4
✔ tibble  3.0.3     ✔ dplyr   1.0.2
✔ tidyr   1.1.1     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Attaching package: ‘scales’


The following object is masked from ‘package:purrr’:

    discard


The following object is masked from ‘package:readr’:

    col_factor



Attaching package: ‘psych’


The following objects are masked from ‘package:scales’:

    alpha, rescale


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Define common functions for notebook use¶

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE, table = TRUE, title = "TMM Normalized data") {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # color - color vector for boxplots
        # plot - flag to show boxplot or not
        # table - flag to print norm factor tables or not
        # title - main title for boxplots
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    if(table == TRUE) {
        cat("\nLibrary size factors:\n", 
            sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    }
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    if(table == TRUE) {
        cat("\nTrimmed mean of M-values (TMM) factors:\n", 
            sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    }
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    if(table == TRUE) {
        cat("\nCombined (lib size and TMM) normalization factors:\n", 
            sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))
    }

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = title)
    }
    tmt_tmm
}

# ============ runs an edgeR exact pairwise test ========================
pairwise_test <- function(delist, pair, p.value) {
    # runs an edgeR exact test and returns results
    et <- exactTest(delist, pair = pair)
    tt <- topTags(et, n = Inf, sort.by = "none")
    
    # print top tags
    cat("\n")
    print(topTags(et)$table, digits = 4)
    cat("\n")
    
    # count candidates
    print(summary(decideTestsDGE(et, p.value = p.value)))

    # see candidates on an MA plot
    plotMD(et, p.value = p.value)
    abline(h = c(-1, 1), col = "black")
    
    # check the p-value distribution
    pval <- ggplot(tt$table, aes(PValue)) + 
      geom_histogram(bins = 100, fill = "white", color = "black") + 
      geom_hline(yintercept = mean(hist(et$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
      ggtitle("p-value distribution")
    print(pval) # this makes the plot show up
    
    tt # return top-tags object
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# =============== log2_fold_changes ================================================
log2_fold_changes <- function(results, left, right, title) {
    # Makes log2 fold change plots by candidate
        # results - results data frame
        # left - lower FC log2 limit
        # right - upper FC log2 limit
        # title - plot title
    
    # see how many candidates by category
    cat("\n")
    print(results %>% count(candidate))
    cat("\n")

    # plot log2 fold-changes by category
    fc <- ggplot(results, aes(x = logFC, fill = candidate)) +
      geom_histogram(binwidth=0.1, color = "black") +
      facet_wrap(~candidate) +
      coord_cartesian(xlim = c(-left, right)) +
      ggtitle(title)
    print(fc)
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, "\\|", simplify = TRUE)
    identifier[,3]
#    identifier <- accession
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags, color = c("red", "blue")) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color_bars = c(rep(color[1], nleft), rep(color[2], nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color_bars, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

Read the IRS-normalized TMT intensity data¶

We checked the data quality in an earlier notebook, and it looked okay.

In [3]:
# load the IRS-normalized data and check the table
data_import <- read_tsv("labeled_grouped_protein_summary_TMT_9_IRS_normalized.txt", 
                        guess_max = 5854)

# "Filter" column flags contams and decoys
# "Missing" column flags proteins without reporter ion intensities (full sets missing)
# the table from pandas is sorted so the rows we want come first
data_all <- filter(data_import, is.na(Filter), is.na(Missing))
data_sl <- select(data_all, contains("SLNorm_"))
data_irs <- select(data_all, contains("IRSNorm_"))

# save gene names for edgeR so we can double check that results line up
accessions <- data_all$Accession

# see how many rows of data we have
length(accessions)

# set some larger plot dimensions
set_plot_dimensions(9, 9)
Parsed with column specification:
cols(
  .default = col_double(),
  Accession = col_character(),
  Identical = col_character(),
  Similar = col_character(),
  OtherLoci = col_character(),
  Filter = col_character(),
  Missing = col_character(),
  Description = col_character()
)

See spec(...) for full column specifications.

4249

Organize the data frame by biological group¶

In [4]:
# define the groups (after IRS)
E15_irs <- select(data_irs, contains("_E15_"))
E18_irs <- select(data_irs, contains("_E18_"))
P0_irs <- select(data_irs, contains("_P0_"))
P3_irs <- select(data_irs, contains("_P3_"))
P6_irs <- select(data_irs, contains("_P6_"))
P9_irs <- select(data_irs, contains("_P9_"))

# put groups together into a single data frame
tmt_data <- bind_cols(E15_irs, E18_irs, P0_irs, P3_irs, P6_irs, P9_irs)

# Set color vector:
colors <- c(rep("blue", 6), rep("red", 6), rep("green", 6))

# make indexes for young and old
young <- 1:6
middle <- 7:12
old <- 13:18

Load data into edgeR data structures¶

We are defining the groups that will be compared explicitly and using all the samples for variance estimates. We put the data into a data frame, grouped by condition. We defined some column indexes for each condition, set some colors for plotting. We will not run TMM normalization (relying on the grand total normalization only), check the final intensity distributions, and see how distinct the biological groups are in an MDS cluster view.

In [5]:
# get the biological sample data into a DGEList object
group <- c(rep('young', 6), rep('middle', 6), rep('old', 6))
y <- DGEList(counts = tmt_data, group = group, genes = accessions)

# run TMM normalization so we have the results in y
# y <- calcNormFactors(y)

tmt_tmm <- apply_tmm_factors(y, table = TRUE, color = colors, title = "IRS only (no TMM)")

# check the clustering
plotMDS(y, col = colors, main = "IRS only (no TMM)")
Library size factors:
 IRSNorm_E15_1 -> 1.031476
 IRSNorm_E15_2 -> 1.008094
 IRSNorm_E15_3 -> 0.986400
 IRSNorm_E18_1 -> 1.010526
 IRSNorm_E18_2 -> 1.012694
 IRSNorm_E18_3 -> 0.997802
 IRSNorm_P0_1 -> 0.998726
 IRSNorm_P0_2 -> 0.991580
 IRSNorm_P0_3 -> 0.999595
 IRSNorm_P3_1 -> 0.986234
 IRSNorm_P3_2 -> 0.999611
 IRSNorm_P3_3 -> 1.005907
 IRSNorm_P6_1 -> 0.980859
 IRSNorm_P6_2 -> 0.993474
 IRSNorm_P6_3 -> 1.011724
 IRSNorm_P9_1 -> 0.974078
 IRSNorm_P9_2 -> 1.004781
 IRSNorm_P9_3 -> 1.009580

Trimmed mean of M-values (TMM) factors:
 IRSNorm_E15_1 -> 1.000000
 IRSNorm_E15_2 -> 1.000000
 IRSNorm_E15_3 -> 1.000000
 IRSNorm_E18_1 -> 1.000000
 IRSNorm_E18_2 -> 1.000000
 IRSNorm_E18_3 -> 1.000000
 IRSNorm_P0_1 -> 1.000000
 IRSNorm_P0_2 -> 1.000000
 IRSNorm_P0_3 -> 1.000000
 IRSNorm_P3_1 -> 1.000000
 IRSNorm_P3_2 -> 1.000000
 IRSNorm_P3_3 -> 1.000000
 IRSNorm_P6_1 -> 1.000000
 IRSNorm_P6_2 -> 1.000000
 IRSNorm_P6_3 -> 1.000000
 IRSNorm_P9_1 -> 1.000000
 IRSNorm_P9_2 -> 1.000000
 IRSNorm_P9_3 -> 1.000000

Combined (lib size and TMM) normalization factors:
 IRSNorm_E15_1 -> 1.031476
 IRSNorm_E15_2 -> 1.008094
 IRSNorm_E15_3 -> 0.986400
 IRSNorm_E18_1 -> 1.010526
 IRSNorm_E18_2 -> 1.012694
 IRSNorm_E18_3 -> 0.997802
 IRSNorm_P0_1 -> 0.998726
 IRSNorm_P0_2 -> 0.991580
 IRSNorm_P0_3 -> 0.999595
 IRSNorm_P3_1 -> 0.986234
 IRSNorm_P3_2 -> 0.999611
 IRSNorm_P3_3 -> 1.005907
 IRSNorm_P6_1 -> 0.980859
 IRSNorm_P6_2 -> 0.993474
 IRSNorm_P6_3 -> 1.011724
 IRSNorm_P9_1 -> 0.974078
 IRSNorm_P9_2 -> 1.004781
 IRSNorm_P9_3 -> 1.009580

Groups separate left-to-right¶

This testing is mostly a categorization exercise. We are lumping E15 and E18 together to get a "younger" age and lumping the P6 and P9 together to get an "older" age. The pairwise comparison will have three categories: proteins that went up, proteins that did not go up or down, and proteins that went down with age. This may help up interpret the ANOVA results.

EdgeR statistical testing starts here¶

Compute the shared variance trend¶

One of the more powerful features of edgeR is computing variances across larger numbers of genes (proteins) to get more robust variance estimates for small replicate experiments. Here, we have 18 samples across all ages to use to improve the variance estimates and reduce false positive differential expression (DE) candidates. We have an edgeR estimateDisp function that does all of this and a visualization function to check the result.

We need to estimate the dispersion parameters before we can do the actual statistical testing. This only needs to be done once. Each exact test will take specified conditions and compare them using the normalization factors and dispersion estimates saved in the DGEList object y.

In [6]:
# compute dispersions and plot BCV
y <- estimateDisp(y)
plotBCV(y, main = "BCV plot of IRS/TMM normalized data")
Design matrix not provided. Switch to the classic mode.

Young lenses (n=6) versus older lenses (n=6)¶

Compare young (E15 and E18) lenses to old (P6 and P9) lenses.

We will specify the pair of interest for the exact test in edgeR and use the experiment-wide dispersion. The topTags table show us the most significant candidates (top 10). We can also summarize how many up and down regulated candidates we have at an FDR of 0.10. We save the test results in tt. We use a handy MA (mean-average) plotting function from limma to visualize the DE candidates and check the p-value distribution.

In [7]:
# use a 0.05 cutoff
tt = pairwise_test(y, c("young", "old"), 0.05)
                     genes  logFC logCPM    PValue       FDR
2188 sp|Q923D2|BLVRB_MOUSE -2.353  3.631 1.624e-28 6.900e-25
2338  sp|E9Q414|APOB_MOUSE -2.695  3.565 3.567e-28 7.578e-25
1009 sp|Q8JZK9|HMCS1_MOUSE -2.313  6.534 1.715e-27 2.429e-24
3012  sp|Q7TNP2|2AAB_MOUSE -2.384  3.058 2.620e-27 2.784e-24
1976  sp|Q9CZR8|EFTS_MOUSE -1.813  4.430 2.609e-26 2.217e-23
1535 sp|Q78IK4|MIC27_MOUSE -1.632  5.407 4.670e-26 3.266e-23
1259 sp|Q9DCD6|GBRAP_MOUSE -1.548  5.418 5.381e-26 3.266e-23
1840 sp|Q07417|ACADS_MOUSE -2.065  4.734 8.329e-26 4.424e-23
3887  sp|P36552|HEM6_MOUSE -2.805  1.753 2.016e-25 8.798e-23
981   sp|P23492|PNPH_MOUSE -1.567  6.481 2.071e-25 8.798e-23

       old-young
Down        3444
NotSig       661
Up           144

Without TMM, testing results are very different¶

We have a few up candidates (152 major lens proteins), a small number of unchanged proteins (684), and nearly all the proteins significantly down in relative abundance (3414). This illustrates how the behavior of the major lens proteins like the crystallins influence the lens proteome and why thinking about how to normalize the data is so critical. Using "grand total normalization only" like we are doing here might be a good way to find the major lens proteins. It is interesting to note that there are just 16 crystallins in the results list, a few intermediate filament and other structural proteins, and a handful of fiber cell plasma membrane proteins that are thought of as the major lens proteins (maybe 25-30 proteins). We have about 150 (120 to 180 depending on FDR cutoff) putative major lens proteins based on this comparison suggesting that there may be many other less abundant proteins that are critical for lens function that have not been detected in past studies.

The p-value distribution has a typical pattern of two distributions. There is a uniform (flat) distribution of p-values from 0 to 1 from chance associated with unchanged proteins. There is a sharp excess of small p-values that come from the (putative) DE candidates.

Compute averages, flag candidates, and save the pair-wise test results¶

We will add the statistical testing results (logFC, p-values, and FDR), condition intensity averages, and candidate status to the results data frame (which has the TMM-normalized data) and accumulate all comparisons into all_results.

We will make MA plots, scatter plots, and a volcano plot using ggplot2. We will also look at the distribution of log2 expression ratios separated by differential expression (DE) category. The Benjamini-Hochberg corrected edgeR p-values are used to categorize the DE candidates: no > 0.10 > low > 0.05 > med > 0.01 > high.

In [8]:
# get the results summary
results <- collect_results(tmt_tmm, tt$table, young, "young", old, "old")

# make column names unique by adding comparison (for the accumulated frame)
results_temp  <- results
colnames(results_temp) <- str_c(colnames(results), "_Y_O")

# accumulate testing results
all_results <- cbind(accessions, results_temp)

Count candidates and look at fold-change distributions¶

In [9]:
# count candidates and check log2 fold change distributions
log2_fold_changes(results, 4, 4, "Younger vs Older lenses logFC distributions by candidate")
  candidate    n
1      high 3302
2       med  286
3       low  131
4        no  530

Distributions of log2 ratios are very skewed¶

The results are dominated by down (in older lenses) candidates, many with larger fold changes.

Main summary plots¶

We have many comparisons to visualize, so we will use functions to generate a series of plots. We will make: an MA plot with candidates highlighted by color, faceted MA plots separated by candidate status, a scatter plot with candidates highlighted by color, faceted scatter plots separated by candidate status, and a volcano plot with candidates highlighted by color. The solid black lines in the MA and scatter plots are the 1-to-1 lines; the dotted lines are 2-fold change lines.

MA plots¶

In [10]:
# make MA plots
MA_plots(results, "ave_young", "ave_old", "Younger vs older lenses")

The non-candidates and candidates are a bit extreme¶

We see a pattern of increasing fold changes as we cut candidates by increasingly stricter FDR (Benjamini-Hochberg corrected p-values) thresholds. This is more easily seen in the facetted plots (the second panel). We do not have a very large set of unchanged proteins because of the normalization and composition of the lens proteome.

Scatter plots¶

In [11]:
# make scatter plots
scatter_plots(results, "ave_young", "ave_old", "Younger vs older lenses")

Volcano plot¶

In [12]:
# make a volcano plot
volcano_plot(results, "ave_young", "ave_old", "Younger vs older lenses")

Check some individual proteins¶

Since we have many candidates, we will look at the top 50 by p-value in each direction.

In [13]:
# look at the top candidates
set_plot_dimensions(9, 3.5)

n_left <- 6
n_right <- 6
top_tags <- 50
color <- c("blue", "green") 

plot_top_tags(results, n_left, n_right, top_tags, color)
set_plot_dimensions(9, 9)

Top candidates look convincing¶

Lumping E15 and E18 together for younger lenses and P6 and P9 together for older lenses reduces some of the age resolution in favor of a more robust comparison (n=6 instead of n=3). This time course is simpler than most. We have three major cell populations: epithelial cells (monolayer on the anterior face of the lens), differentiating cortex cells (the outer layers of the lens (elongating cells with nuclei)), and the mature fiber cells (bags of crystallins and other major lens proteins).

Summary¶

We demonstrated exactly why the TMM-normalization method was developed by using only a grand total intensity normalization here. Increased abundance of a small number of highly abundant proteins forces most of the proteome to lower abundance changes that were large enough to create huge numbers of statistically significant down candidates. Whether or not this is a useful way to analyze this data may depend on what questions are being asked. This might help find major lens proteins (this can be tested with the previously known lens proteins). It seems like TMM normalized data produces a more sensible pattern of candidates for probing developmentally regulated proteins. (But I do not have the expertise to know if that is true.)

This simplified way to look at a time course experiment is not very elegant. It relies on an assumption that protein changes will be increasing with age, decreasing with age, or relatively flat with age. There will be some lens growth during the developmental ages. If we think about the lens as a sphere, epithelial cells grow like the surface area (R^2) and mature fiber cells grow like the volume (R^3). Major lens proteins should go up across the ages, epithelial cell proteins should go down. Proteins from the cortex probably go down, but none of these cell populations are homogeneous. Epithelial cells at the start of differentiation will be more typical cells. Mature fiber cells are mostly major lens proteins with regular cell machinery and structures degraded. The cortex cells span this spectrum.

Save the all_results frame to TSV file¶

We collected the normalized data and statistical testing results and we can save that table to make a nice spreadsheet summary.

In [14]:
write.table(all_results, "Khan-2018_results_edgeR-exact_noTMM.txt", sep = "\t",
           row.names = FALSE, na =  " ")

Log the session information¶

In [15]:
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_2.0.7     edgeR_3.24.3    limma_3.38.3    scales_1.1.1   
 [5] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4    
 [9] readr_1.3.1     tidyr_1.1.1     tibble_3.0.3    ggplot2_3.3.2  
[13] tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       locfit_1.5-9.4   lubridate_1.7.9  lattice_0.20-41 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.8  reprex_0.3.0    
[13] evaluate_0.15    httr_1.4.2       pillar_1.4.6     rlang_1.0.2     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  blob_1.2.1      
[21] labeling_0.3     splines_3.5.3    munsell_0.5.0    broom_0.7.0     
[25] compiler_3.5.3   modelr_0.1.8     pkgconfig_2.0.3  base64enc_0.1-3 
[29] mnormt_1.5-6     htmltools_0.4.0  tidyselect_1.1.0 crayon_1.3.4    
[33] dbplyr_1.4.4     withr_2.2.0      grid_3.5.3       nlme_3.1-148    
[37] jsonlite_1.7.0   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_3.3.0        stringi_1.4.6    farver_2.0.3    
[45] fs_1.5.0         xml2_1.3.2       ellipsis_0.3.2   generics_0.0.2  
[49] vctrs_0.4.1      IRkernel_1.1.1   tools_3.5.3      glue_1.6.2      
[53] hms_0.5.3        parallel_3.5.3   colorspace_1.4-1 rvest_0.3.6     
[57] pbdZMQ_0.3-3     haven_2.3.1     
In [ ]: